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Abstract. We derive the posterior contraction rate for non-parametric Bayesian 
J^ ■ estimation of the intensity function of a Poisson point process. 
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£NJ , 1. Introduction 
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Poisson point processes (see e.g. iKingmanl (J1993I )) are among the basic modelling 

tools in areas as different as astronomy, biology, image analysis, reliability theory, 

medicine, physics, and others. A Poisson point process X on the space X = [0, l] d 

^ c| (this is good enough for our purposes) with the Borel cr-field B(X) of its subsets 

"£i , is a random integer- valued measure on X (we assume the underlying probability 

space (51, J 7 , Q) in the background), such that 

(i) for any disjoint subsets Bi,B2, ■ ■ ■ ,B m G B(X), the random variables 

X(Bi), A(i? 2 ), ■ ■ ■ , X(B m ) are independent, and 
(ii) for any B € B(X), the random variable X(B) is Poisson distributed with 
parameter A(B), where A is a finite measure on (X, B(X)), called the 
jy-v intensity measure of the process X. 

Intuitively, the process X can be thought of as random scattering of points in X, 
where scattering occurs in a special way determined by properties (i)-(ii) above. 

In practical applications knowledge of the intensity A is of importance. The 
latter typically cannot be assumed known beforehand and has to be estimated 
based on the observational data on the proces s X. A popular a ssumption in the 



literature (see e.g. the references on p. 263 in iKutovantsI (J1998I )) is that one has 
independent observations Ai , . . . , X n on the process X over X at his disposal, on 
k> ' basis of which an estimator of A has to be constructed. We will denote for brevity 

j_i , X( n ' = (Ai, A2, . . . , A„). In case A is absolutely continuous with respect to some 

dominating measure and has a density A, one might also be interested in estimation 
of A. We will assume that A is absolutely continuous with respect to the Lebesgue 
measure on X and will call A the intensity function. 

From now on we concentrate on estimation of the intensity function. Two broad 
approaches to estimation of A, parametric and non-parametric, can be discerned in 
the literature. In the parametric approach, one assumes that the unknown intensity 
function A can be parametrised by a finite-dimensional parameter 9 (where, for 
instance, 9 ranges in some subset O of MP), so that A = Xg, and the corresponding 
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statistical experiment generated by X^ is denoted by (X n , B(X n ), {P^ n) , 9 € 6}). 
The goal is to estimate the 'true' parameter Oq on the basis of the sample X^ n \ 
In the non-parametric approach to estimation of A no such assumptions are made. 
Instead, one assumes, for instance, that A belongs to some class 9 of functions 
possessing given smoothness properties (the statistical experiment generated by 
JfW is (X n ,B(X n ), { Pl n) ,A€ 0| ) ), and the goal is to estimate the 'true' intensity 
function Aq. See e.g. iKutovantsI (|1998f) for additional information on statistical 



inference for Poisson point processes from the point of vi ew of asymptotic statistical 
theory . Computational ap proaches are studied e.g. in Moller and Waagepetersen 



tneory. computational approacnes are studied e.g. in moiici 
(|2004l ) and are reviewed in iMoller and Waagepetersen! (|2007l ) . 



In this note we are interested in non-parametric estimation of the intensity func- 
tion Aq. A kernel -type estimator of Aq has been studied in detail in Section 6.2 in 



Kutovantsl (I1998T). see also p. 263 there for further references. In particular, it is 



shown in IKutovantsI ( 19981) that this estimator is optimal in the minimax sense over 
the class of /3-H61der-regular intensity functions. 

Here we will take an alternative, non-parametric Bayesian approach to estimation 
of Ao, but will analyse it from the frequentist point of view. In the Bayesian 
approach to estimation of Ao one puts a prior n on Ao , which might be thought of 
as reflecting one's prior knowledge or belief in Ao- In more formal terms this is a 
measure II defined on the parameter set equipped with some cr-field c(0), and 
one assumes that Ao S 0. The set equipped with a certain cr-field c(0) is a set of 
finite- valued functions defined on [0, l] d , which we for technical reasons assume t o 
be uniformly bounded away from zero. Then by Theorem 1.3 in IKutovantsI ((1998), 



for any A € 0, the law P> of X under the parameter value A admits a density p\ 
with respect to the measure P sp induced by a standard Poisson point process with 
intensity measure A sp (da;) = dx. This density is given by 



P\(0 =exp / 



logA(a:)£(da:)- / [X(x)-l]dx 

[04] d i 



where £ = 2i=i <^e« ^ s a realisation of X (here S Xi denotes the Dirac measure at Xi) 
and 

r m 

logA( 2 ;)e(dx)=Vlog(A(x 4 )). 

Using independence of AVs, it follows that the likelihood L\(X^) for XW can be 
written as 

(1) Lx(X^) = TTexp ( / logAOr^dz) - / [X(x) - l]dx ) . 

f = l \J[Q,l] d -/[0,l] d / 

Assuming joint measurability of p\(£) in (f, A), so that the integrals below make 
sense, Bayes' formula gives the posterior measure of any measurable set A e cr(0) 
through 

n ( A|x<">) = ^f'^l 

/ e Lj(A'W)dn(A) 

Transition from the prior to the posterior can be thought of as updating our prior 
opinion on Ao upon seeing the data X^ n K 

We will study the rate of convergence of the posterior distribution n(-|A(™') 
under P^™ , where P^ denotes the law of X^ under the true parameter value Aq. 



NON-PARAMETRIC ESTIMATION FOR POISSON POINT PROCESSES 3 

The idea, informally speaking, is that with the sample size n growing indefinitely, 
the Bayesian approach should be able to recognise the true Ao with increasing 
accuracy. This can be formalised by requiring, for instance, that for any fixed 
neighbourhood A of Ao, II(A c |AT( n ') — > in P A -probability, or, in words, by re- 
quiring that with the Bayesian approach with a prior n, most of the posterior mass 
must eventually concentrate around the true parameter value Ao- More generally, 
one might take a sequence of shrinking neighbourhoods A n of Ao and ask what is 
the fastest rate, at which the neighbourhoods A n can shrink, while still captur- 
ing most of the posterior mass (the precise definition will be given below). The 
case for such an approach to the study of n on-parametric Bayesian techniques is 
made e.g. in iDiaconis and Freedmanl ( 1986T) . while several recent references deal- 



ing with establishing p osterior conve r gence r ates under broad condition s in va rious 
statistical settings are Ghosal et al. POOOI ). iGhosal and van der Vaartl (|200lh and 



van der Vaart and van Zantenl (|2008af ). The rate, at which the neighbourhoods A r< 



shrink, can be thought of as an analogue of the convergence rate of a frequentist 
estimator. The analogy can be made precise in the sense that contraction of the 
posterior distribution at a certain rate implies existence of a Bayes point estimate 
with the same conv ergence rate (in the frequentist sense); see e.g. Theorem 2.5 in 
Ghosal et al.l ( 20001 ) and the discussion on pp. 506-507 there. 



The rest of the paper is organised as follows: in the next section we state the 
problem we are interested in in greater detail and provide a general result on the 
posterior contraction rate in our problem with the prior based on a transformation 
of a Gaussian processes. In Section[3]we consider a concrete example of the prior and 
compute the posterior contraction rate explicitly. Finally, Appendix [X] contains the 
proof of the technical lemma used in the proof of our main theorem. Computational 
aspects of the non-parametric Bayesian approach to the inte nsity function estima- 
tion lie outside the scope of this note. Instead we refer to iHeikkinen and Arjasl 



(1998) for one specific implementation. 

2. Main result 

In order to study the contraction rate of the posterior distribution in our setting, 
we first need to specify the suitable neighbourhoods A n of Ao, for which this will 
be done. The Hellinger distance /i(Pa x , Pai ) between two probability laws ¥\ 1 and 
Paj is defined as 



MPai.Pa,) 



V'-dpVV 



(P^-Pli 2 ?* 



1/2 



1/2 



Here, as in Section [1] we assume that Aj's are bounded away from zero and infinity, 
which yields in particular the second equality in the above display. The Hellinger 
distance is one of the popular discrepancy measures between two probability laws. 
The Hellinger distance can also be used to define the pseudo-distance p(Ai,A 2 ) 
between parameters Ai and A2 by setting 

p(Ai,A 2 )=MP Ai ,Pa 2 ). 

Thus Ai and A2 are close to each other if the corresponding laws ¥\ 1 and P\ 1 are 
in Hellinger distance. We also introduce two further discrepancy measures: the 
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Kullback-Leibler divergence KL(P Ai ,Pa 2 ) between two probability laws P Al and 
Pxo is defined as 



KL(P Al ,P A J = flog 



\ 2 



/ P\! lOg ( — 

J VPA 2 



spj 



while the discrepancy V is defined through 



\-2 



2 



V(P Al ,P Aa )= / (log(^]-KL(P Al ,P A2 )) dl 



|p Al (log(^-KL(P Al ,P A2 ) 



2 

dP s 



This can be thought of as the Kullback-Leibler 'variance'. Both quantities are well- 
defined, because under our standing assumption that Ai and A2 are bounded away 
from zero, the corresponding laws P Al and P A2 are equivalent. 

We will derive the posterior convergence rate by taking the neighbourhoods A n 
of Ao to be balls of appro priate radii in t he ps eudo-distance p, see below. This is a 
reasonable choice, see e.g. lGhosal et al.1 (|2000| ). 



We need to specify the prior II. Priors based on stochastic processes are widely 
used in Bayesian statistics. In particular, priors based on Gaussian processes are 
a popular choice both in t he sta tistics and machi ne learning co mmunities, see e.g . 
Rasmusscn and Williams (2006), as well as Ivan der Vaart and van Zantcn (2008a) 
for additional references. For our purposes, a zero-mean Gaussian process W = 
{W x ) x eX is a collection of random variables W x indexed by X and defined on the 
common probability space (fi, J 7 , P), such that the finite-dimensional distributions 
of W are zero-mean multivariate normal distributions. The latter are determined 
by the covariance function K : X x X — > R, defined by 

K(x,y)=E[W x W y ], x,y<=X, 

where E denotes the expectation with respect to the measure P. For all the nec- 
essary definitions and properties of Gaussian processes with a view towards ap- 
plications in non-parametric Bayesia n sta tistics that are used in this work, se e 
van der Vaart and van Zantenl (|2008al ) and Ivan der Vaart and van Zantenl (|2008bl) . 



Assume that W is a zero-mean Gaussian process with bounded sample paths 

x h-> W x and let n > be a fixed constant. Define the process Z^ w ^ — ( Z x J 

V / xex 

through 

(2) Zf = K +\W X \, xeX. 

Realisations of W will be denoted by lowercase letters, such as w and v. The 
corresponding realisations of Z^ w ^ will be denoted by z^ and z^ v \ Our prior II 
will be the law of the process Z^ w \ which implicitly defines our parameter set O. 
The only reason for using the constant k > in the definition of the process Z^ w ^ is 
to make its sample paths strictly positive, which allows one to avoid complications 
in the definition of the likelihood ([TJ . The constant k can be taken to be arbitrarily 
small. Note that the process W can be viewed as a map with values in the Banach 
space £°°(X). In applications, sample paths of W typically possess some smoothness 
properties and W can also be viewed as a map taking values in a Banach space (B, || • 
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||oo) for some B C IP°{X). We will assume that this map is Borel-measurable, so that 
W is a B- valued random element. By Lemma 5.1 in van der Vaart and van Zanten 



(J2008b[ ). the support of W, i.e. the smallest closed set B C B, such that P(W <E 
Bo) = 1, is the closure in B of the reproducing kernel Hilbert space (RKHS) (H, || • 
||h) attached to W. It can be shown that this RKHS can be identified with the 
completion of the set of maps 

k k 

x^ y £ j a i K{y i ,x)=^[W x H], H = Y f a i W Vt , 

i=\ i=l 

under the inner product 

(E{W.H 1 ],E[W.H 2 }) a = E[H 1 H 2 }. 

Here the a^'s range over R and k ranges over N. The support of the process Z^ w ^ 
can then be described through this characterisation of the support of the process 
W. 

Remark 1. Other transformations of the process W can also be used to define the 
process Z^ w \ For instance, one can set Z x = g(W x ) for a fixed function g that 
is bounded away from zero and possesses suitable regularity properties. □ 

Let N(s, B, f) denote the minimum number of balls of radius e needed to cover 
a subset B of a metric space with metric /. This is the e-cov ering number of B. 



Our main result is based on an application of Theorem 2.1 in Ghosal and van der Vaart 



( 20011 ) (which is a slight modification of Theorem 2.1 in Ghosal et al.l ([2000)) and 



Theorem 2.1 from Ivan der Vaart and van Zantenl ( 2008a ). These are provided be- 



low for the reader's convenience in an adapted form. 



Theorem 1 (jGhosal and van der Vaartl ([20011 )). Suppose that for positive sequences 



£n,£n —>■ 0, such that nmin(e^,e^) — J- oo, constants C\,c%, 03,04 > and sets 
0„ C O, we have 

(3) logN(e n ,<d n ,p) < cine 2 n , 

(4) n(e\e„)<c 3 e-"^ C2+4 \ 

(5) n (>"> e 6 : KL(Ao,2 (w) ) < e* n , V(A ,z w ) < e 2 n ) > c 4 e~ c ^. 
Then, for e n — max(e n ,£„) and a large enough constant M > 0, we have that 

(6) n(z^) e O : p(A , z (,u) ) > Me n \X^) -> 

in WJ l -probability. 

Remark 2. Note that the posterior contraction rate e„ from Theorem [1] is not 
uniquely defined. If e n is a posterior contraction rate, then so is, for instance, 
(3 — sin(n))e n as well, or in fact any sequence that converges to zero at a slower rate 
than £„. In general we are interested in finding the 'fastest' posterior contraction 
rate e„, in the sense that ^ holds for this e n and there is no other sequence e' n — > 0, 
such that lim„_ i . 00 e' n /e n — 0, for which (O still holds with e„ replaced by e' n (and 
perhaps the constant M replaced by another constant M'). □ 

The conditions of Theorem [1] merit some discussion. We restrict ourselves to 



heuris tic reasoning only: an in-depth discussion can be found in iGhosal et al 



(J2000l ) . The important conditions of the theorem are (J3|) and ([5]) . Since the covering 
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number can be thought of as measuring the size of the model, condition ([3]) says 
that in order to have posterior contraction rate e n , the model should not be too big. 
Furthermore, condition ([5]) tells us that in order to have the posterior contraction 
rate e n , the prior II should put some minimal mass in the Kullback-Leibler type 
neighbourhoods of Ao- Finally, condition (|4]) adds some additional flexibility: for 
our purposes it is enough to be understood in the sense that the sets O n are almost 
the support of the prior. This condition often allows one to avoid too stringent 
assumptions on the parameter set 0, such as, for instance, its compactness. 

Next we need to find effective means for checking the fact that our model and the 
prior satisfy the conditions of Theorem [H T o that end we will employ Theorem 2.1 



from Ivan der Vaart and van Zantenl (2008a)). The following concept is needed in its 



statement: for a function Ao : X — > E define the function <f>r : R — > K through 
(7) </%(£)= inf. ^||ft||i-logP(||W|| 00 <e). 

This is called the concentration function of the Gaussian process W. 



Theorem 2 ( van der Vaart and van Zantenl (2008aJ)). Let A be contained in the 



support of W. For any sequence of positive numbers i n > satisfying 

(8) <h o (i n ) < ne 2 n 

and any constant C > 1 with exp(— Cni^) < 1/2, there exist measurable sets B n C 
B, such that 

(9) logiV(3£ n ,B„,||-|| co )<6Cnet, 

(10) P(W(£B n ) <e~ Cni ^, 

(11) P(||VF-Ao||oo<2£ n )>e-'< 

Comparing the three conditions ([I])-© from Theorem Q] to the three conditions 
(f9j)- (fTTj) from Theorem[2j we see that they are of a similar type. Once we bridge the 
Hellinger distance, the Kullback-Leibler divergence and the divergence V appearing 
in Theorem [T] with the || ■ Hoc-distance, Theorems [T] and [2] will yield the posterior 
contraction rate. 

The following lemma serves the purpose of bounding the divergences appearing 
in Theorem [T] Its proof is found in Appendix |XJ 

Lemma 1. Let X\(x) = K + \w x \ and \2{x) = K + \v x \ for w, v € 1°°{X). Then 

(i) MP^PaJ^IItu-uIIoo; 

(ii) KL(P Al ,P A2 )<I|| W -v||^; 
(in) V(P Ai ,Pa 2 ) < i||«> - «||2o (1 + ~lk- »||oo) • 
The following is our main result. 

Theorem 3. Let Ao = k + Ao for Ao > that is contained in the support of 
W. Suppose the prior II is the law of the process Z^ w ' = (Z x ) x ex for Z x = 
k + |Wjc|. Then for a sequence e n — e n satisfying the assumptions of Theorem^ 
and a sufficiently large constant M > 0, the posterior distribution for X Q relative to 
the prior II satisfies 

n{z^ G e : p(A ,z M ) > Me n \X^) -> 

in P A " -probability. 
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Proof. For B n as in Theorem [2 set 6 n = {z^ w ' : w 6 B n }. We need to verify the 
conditions of TheoremQ] Denote c K = (1/k+ 1/k 2 ) and let a constant C > 1 from 
Theorem [5] be large enough, so that 

1 C 

< 4. 

4c K 4c K 

Let i n be a sequence of positive numbers satisfying the conditions of Theorem [5] 
Take e„ = 3k _1 / 2 £„. By Lemma Q](i) and by inequality ©, 

2kC 
logiV(e„,e n ,p) < log7V(3£„,B„, || • ||oo) < 6Cne 2 = — ne 2 , 

which verifies ([3]) for the constant c\ = 2kC/3. Furthermore, for n large enough, so 
that e„ is small, Lemma Q] (ii)-(iii) yields that 

f W e 9 : KL(A ,z M ) < ^,V(A ,z w ) < ^} D {«> : c*||u;- Ao||L < ^}- 
Set £ n = 2y/c^e n - It follows from the above display and (jTTJ) that 

II (zW G 6 : KL(Ao,« (u,) ) < e 2 ,V(A ,z^) < e 2 ) > P(||W- A |U < 2e„) 

This verifies © for c 4 = 1 and c 2 > l/(4c K ). Finally, by ([TO]). 

n(0 \ 9„) = P(W^ ^ S„) < e- c "^ = cxp (—^n? n 

This verifies (|3]) for C3 = 1 and c 2 < C/(4c K ) — 4. Theorem Q] then yields the 
posterior contraction rate e„ = max(e„,e n ). Since both e n and e n are proportional 
to e n , we can simply take £„ = e n and absorb the constants in the constant M in 
the statement of Theorem [1] This completes the proof. □ 

R emark 3. Due to b oundary bias problems characteristic of kernel-type estimators, 
in Kutovantsl (J1998I ) the properties of a kernel estimator of A (a;) are studied only 



for x restricted to a compact set strictly contained in X. On the other hand, our 
non-parametric Bayesian approach does not suffer from this limitation (the pseudo- 
distance p(Ai, A2) is a global distance using all the values of Ai and A 2 on X). □ 

Remark 4. Motivate d by appli c ations of the so-called log-Gaussian Cox processes 
(see e.g. Chapter 6 in lKingmanl ( 19931 ) for more information on Cox processes), one 



could have argued that a reasonable prior li for Ao would have been the process 
ZW = (zi w) ) defined through 

This transforms W into a strictly positive process Z^ ', Moreover, if the sample 
paths of W are, say, /3-H61der-regular, so will be the sample paths of Z^ w ^ . However, 
examination of the proof of Lemma Q] shows that in this case there does not seem to 
exist a good way to control the probability divergences in the statement of Theorem 
[I] in terms of the || • ||oo-distance. This then does not permit to invoke Theorem [2] 
in order to derive the posterior contraction rate. We suspect that for such a prior 
the posterior contracts at a suboptimal rate (in a sense that there exists some other 
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prior, for which the posterior contracts at a faster rate e' n ] cf. Remark [2]). On a 
practical level, an objection that can be advanced against such a prior for Ao is that 
the process Z^ w ^ grows too 'fast', since so does the exponential function. □ 

Remark 5. An interesting statistical problem related to the one we are considering 
in this note is non-parametric estimation of the intensity function of a cyclic Poisson 
point processes over X — [0, T] d (i.e. a Poisson point process with a periodic inten- 
sity function). A recent reference de aling with estimation of the unknown period 
in this model is iBelitser et all (J2012I) . □ 



3. Example of the prior 

In this section we consider a concrete example of the prior and compute the 
posterior contraction rate for it explicitly. Let for simplicity d = 1. We recall the 
definition of a /^-Holder-regular function: a function A : X — > M. is said to be j3- 
Holder regular for j3 > 0, if it is continuously diffcrcntiable up to order |_/?J (here 
L/3J denotes the largest integer strictly smaller than fi. For [f3\ = we assume 
that A is continuous) and the derivative A^J) satisfies the Holder condition of 
order /3 — [/3J . We will denote the space of /3-Holder- regular functions by C@ (X) . 
Furthermore, C(X) will denote the space of continuous functions on X equipped 
with the uniform norm. 

Example 1. Let W = (W x )xex be a standard Brownian motion over the time inter- 
val X — [0, 1] and let 770, 771, . . . , 771 sj+i be standard normal random variables. As- 
sume that r/o,?7i, ... , 77[£j + i, W are independent. The modified Ricmann-Liouville 
process W — (W x ) xe x with Hurst parameter j3 > is defined as 

W x = J2 VkX k + / (x - yf- l ' 2 dW v , yeX, 
fc=0 Jo 



see Section 4.2 in Ivan der Vaart and van Zantenl ( 2008al ). Our prior n will be 



the law of the process Z^ = (z x W) \ defined by ©. By Theorem 4.3 in 



van der Vaart and van Zantenl (|2008al ). the support of W is the whole space C(X), 



and if Aq = K + Ao for a non- negative Ao € C@(X), then St (s) x e ^f as e 1 0. It 



then fo llows from Theorem[3]by solving inequality ([8]) (cf. lvan der Vaart and van Zanten 



(|2008al) . pp. 1449-1450) that the posterior contracts at the rate n- /3/(2/3+1) . This 



is the minimax estimation rate for a /3-H61der-regular function i n a variety of non - 
parametric estimation problems. See in particular Theorem 6.5 in lKutovantd (J1998I ) 
for the Poisson point processes setting. The rate n _/3 ^ 2 ' 3+1 ^ can thus be thought 
of as an optimal posterior contraction rate in this particular setting. □ 

Appendix A. 
Proof of Lemma [3 Part (i) follows from part (ii) and the well-known inequality 

^ 2 (Pa 1 ,Pa 2 )<KL(P Ai ,P A2 ) 

between the squared Hellin ger distanc e and t he Kullback-Leibler divergence (alter- 
natively, see Lemma 1.5 in iKutovantsI (J1998I )). 



NON-PARAMETRIC ESTIMATION FOR POISSON POINT PROCESSES 



We prove part (ii). Using Theorem 1.3 and Lemma 1.1 from iKutovantsI ([1998), 

we have 

(12) KL(P Al ,P A2 )= / X 1 (x)log(^^jdx-^^-l'jx 2 (x)dx. 



J.x 
Now since log(l + x) < x for x > — 1, we get that 

KL(P Al ,P Aa ) < / [\i(x) - \ 2 (x)} 2 -±— 
< - /[Ai(a;)-A 2 (a;)] 2 dx 



da; 



.v 



1 



K 

< -\\w - v\\l, 

ri 

where the last inequality follows from the inequality ||a| — |6|| < \a — b\ valid for 
o.iel. This proves part (ii). Here we also see the role of constant k > 0. 

We prove part (hi). Letting U ~ P Al and denoting by E Al [-] the expectation 
under P Al , we have 

2 fd¥ Xl 



(13) 



V(P Al ,P A2 



E 



Ai 



loe 



vdP Al 



(U) 



rO/(P Al ,P A2 ). 



Using Theorem 1.3 and Lemma 1.1 from IKutovantsI (1998), as well as formula (fT"2")) 



above, after some uninspiring computations we get from (|T^|) that 
V(P Al ,P A2 )= / Ai(s)log 2 f^ttld.'- 



,v 



A 2 (x) 

\>i.rnog 2 (>M)d, 



Ai<A 2 \^(x) 

Al(x)log — r-r 

V.A 2 x 



'Ai>A 2 

= Il+I 2 , 



dx 



with an obvious definition of L and I 2 . Recall the elementary inequality 

< log(l + x) < x, x > -1. 



1 + x 



This inequality gives that on the set {Ai < A 2 }, 



loe 



< 



1 



\i(x) 

X 2 (x)J ~ \\{x) 



[Xi(x) ~ X 2 (x)} 2 



Hence 



Ii<-||Ai-A 2 | 

K 



< — \\w — v\ 



On the other hand, on the set {Ai > A 2 }, 



wM^<^ AlW 



Therefore, 



Ai>A 2 



[Ai(aO-A 2 (a:)j , 2 



Hx) 



2 ^i(x) 
\l(x) 



dx 
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If 1 

[\i(x)-\2(x)] 3 -j—-dx+ / [\!(x) - \ 2 (x)] 2 —--^dx 



Ai>A 2 

< -9 11-^1 - A 2||.oo + - ll^l _ ^2||oo 
K z K 

^h^-viLti + hw-vw 

This completes the proof of part (iii) and hence of the lemma too. □ 
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